Brief Overview 1

Column

In this session, we will use the Black Friday Data available in Kaggle to study how to make the following graphical displays.

Column

Graphical Displays

  • Categorical Data
    • Bar Chart
    • Pie Chart
  • Quantitative Data
    • Histogram
    • Boxplot
    • Scatterplot
    • Line

Common Arguments:

Here is a list of common arguments:

  • col: a vector of colors
  • main: title for the plot
  • xlim or ylim: limits for the x or y axis
  • xlab or ylab: a label for the x axis
  • font: font used for text, 1 = plain; 2 = bold, 3 = italic, 4 = bold italic
  • font.axis: font used for axis
  • cex.axis: font size for x and y axes
  • font.lab: font for x and y labels
  • cex.lab: font size for x and y labels

Brief Overview 2

Row

In this session, we will use the Black Friday Data available in Kaggle to study how to make the following graphical displays.

Row

Graphical Displays

  • Categorical Data
    • Bar Chart
    • Pie Chart
  • Quantitative Data
    • Histogram
    • Boxplot
    • Scatterplot
    • Line

Common Arguments:

Here is a list of common arguments:

  • col: a vector of colors
  • main: title for the plot
  • xlim or ylim: limits for the x or y axis
  • xlab or ylab: a label for the x axis
  • font: font used for text, 1 = plain; 2 = bold, 3 = italic, 4 = bold italic
  • font.axis: font used for axis
  • cex.axis: font size for x and y axes
  • font.lab: font for x and y labels
  • cex.lab: font size for x and y labels

Data

Column

First 500

Observations

Column

Description

In order to understand the customer purchases behavior against various products of different categories, the retail company “ABC Private Limited”, in United Kingdom, shared purchases summary of various customers for selected high volume products from last month. The data contain the following variables.

  • User_ID: User ID
  • Product_ID: Product ID
  • Gender: Sex of User
  • Age: Age in bins
  • Occupation: Occupation (Masked)
  • City_Category: Category of the City (A,B,C)
  • Stay_In_Current_City_Years: Number of years stay in current city
  • Marital_Status: Marital Status
  • Product_Category_1: Product Category (Masked)
  • Product_Category_2: Product may belongs to other category also (Masked)
  • Product_Category_3: Product may belongs to other category also (Masked)
  • Purchase: Purchase Amount
Rows: 550,068
Columns: 12
$ User_ID                    <dbl> 1000001, 1000001, 1000001, 1000001, 1000002…
$ Product_ID                 <chr> "P00069042", "P00248942", "P00087842", "P00…
$ Gender                     <chr> "F", "F", "F", "F", "M", "M", "M", "M", "M"…
$ Age                        <chr> "0-17", "0-17", "0-17", "0-17", "55+", "26-…
$ Occupation                 <dbl> 10, 10, 10, 10, 16, 15, 7, 7, 7, 20, 20, 20…
$ City_Category              <chr> "A", "A", "A", "A", "C", "A", "B", "B", "B"…
$ Stay_In_Current_City_Years <chr> "2", "2", "2", "2", "4+", "3", "2", "2", "2…
$ Marital_Status             <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0…
$ Product_Category_1         <dbl> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5, 8, 8, 1,…
$ Product_Category_2         <dbl> NA, 6, NA, 14, NA, 2, 8, 15, 16, NA, 11, NA…
$ Product_Category_3         <dbl> NA, 14, NA, NA, NA, NA, 17, NA, NA, NA, NA,…
$ Purchase                   <dbl> 8370, 15200, 1422, 1057, 7969, 15227, 19215…

Bar Chart

Row

Bar chart is a graphical display good for the general audience. Here, we study the distribution of Age Group of the company’s customers who purchased their products on Black Friday.

Usage: barplot(height, …)

A bar chart can be horizontal or vertical. Using the argument col, we can assign a color for bars. The argument main could be used to change the title of the figure. We can use RGB color code to assign colors.

Note The margin of a figure could be set using the par() function. The order of the setting is c(bottom, left, top, right).

Analysis

A horizontal bar chart by the distribution of Age Group of the company’s customers who purchased their products on Black Friday. It looks to be right skewed to the older ages and the younger the more is bought.

Row

Vertical Bar Chart

Horizontal Bar Chart

Pie Chart

Column

Similarly, we can use pie chart to study the distribution of the city category.

Usage: pie(height, …)

Tip: Use color palette to choose colors (Google search: color scheme generator).

Analysis

This pie chart has three cities, but it may not say much as cities are not labeled and not much is labeled. It only shows the percentages.

Column

Distribution fo City Category

Histogram

Column

Histogram is used when we want to study the distribution of a quantitative variable. Here we study the distribution of customer purchase amount.

Usage: hist(x, …)

Column

Analysis

This appears to be a right skewed histogram and looks multimodal as there are several peaks. It also looks like most purchases are around 5k to 10k

Boxplot

Column

Boxplot 1

Here, we talk about another graphical display that can be used to study the distribution of a quantitative variable: box and whisker plot (boxplot).

Usage: boxplot(x, …) or boxplot(formula, …)

Boxplot 2

In general, a boxplot is used when we want to compare the distributions of several quantitative variables. In the following we study the distribution of customer purchase amount among different age groups.

Column

Analysis of Boxplot 1

The boxplot is right skewed as the right whisker is longer than the left whisker. There is also some outliers to the right.

Analysis of Boxplot 2

It looks like the distribution of purchase by sex an marital status does not differ by much. There are slight differences but they all appear to be all right skewed and some outliers to the right also.

Scatterplot

Column

When we want to study the relationship of two quantitative variables, a scatterplot can be used. Since this data set doesn’t have another quantitavtive variable, we will use the built-in data mtcars in R. Then we study the relationship of miles per gallon against the weight of vehicles.

Column

Analysis

The graph is showing a moderately strong negative linear relationship between the two variables. The heavier the vehicle the less miles per gallon used.

Line Plot

Column

Data

Since the Black Friday Data are not time series data, it is not approriate to use a line plot. In the following code chunk, we create a data frame using the forecasted highest temperatures from July 13 to July 22 in 2022 (The Weather Channel).

Analysis

There are four different lines for the four cities with their highest temperatures. Fargo has the lowest high temps compared to the other cities and Houston has the highest compared to the others. However, Houston slowly declines in temperature as the days went by. The other three cities had varying temperatures from increasing to decrease.

Column

Line Chart

---
title: "Basic Graphical Displays"
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: default
      navbar-bg: "purple"
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(DT)
library(plotly)
Friday <- read_csv("./Black_Friday.csv")
```

Brief Overview 1
===

Column {data-width=450}
---

In this session, we will use the Black Friday Data available in [Kaggle](https://www.kaggle.com/datasets/pranavuikey/black-friday-sales-eda) to study how to make the following graphical displays.

Column {.tabset data-width=550}
---

### Graphical Displays
- Categorical Data
  - Bar Chart
  - Pie Chart
  
- Quantitative Data
  - Histogram
  - Boxplot
  - Scatterplot
  - Line

### Common Arguments:
Here is a list of common arguments:

  - col: a vector of colors
  - main: title for the plot
  - xlim or ylim: limits for the x or y axis
  - xlab or ylab: a label for the x axis
  - font: font used for text, 1 = plain; 2 = bold, 3 = italic, 4 = bold italic
  - font.axis: font used for axis
  - cex.axis: font size for x and y axes
  - font.lab: font for x and y labels
  - cex.lab: font size for x and y labels

Brief Overview 2 {data-orientation=rows}
===

Row {data-height=100}
---
In this session, we will use the Black Friday Data available in [Kaggle](https://www.kaggle.com/datasets/pranavuikey/black-friday-sales-eda) to study how to make the following graphical displays.

Row {data-height=900}
---
### Graphical Displays
- Categorical Data
  - Bar Chart
  - Pie Chart
  
- Quantitative Data
  - Histogram
  - Boxplot
  - Scatterplot
  - Line

### Common Arguments:
Here is a list of common arguments:

  - col: a vector of colors
  - main: title for the plot
  - xlim or ylim: limits for the x or y axis
  - xlab or ylab: a label for the x axis
  - font: font used for text, 1 = plain; 2 = bold, 3 = italic, 4 = bold italic
  - font.axis: font used for axis
  - cex.axis: font size for x and y axes
  - font.lab: font for x and y labels
  - cex.lab: font size for x and y labels
  
  
Data
===
  
Column {data-width=550}
---
  
### <b><font size = 4><span Style = "color:blue">First 500 
Observations</span></font></b>
  
```{r show_table}
datatable(Friday[1:500,], rownames = FALSE, colnames= c("User ID", "Product ID", "Gender", "Age", "Occupation", "City Category", "Stay In Current City Years", "Marital Status", "Product Category 1", "Product Category 2", "Product Category 3", "Purchase"), options = list(pageLength = 20))
```
  

Column {data-width=450}
---

### <font size = 4><span Style = "color:red">Description</span></font>

In order to understand the customer purchases behavior against various products of different categories, the retail company "ABC Private Limited", in United Kingdom, shared purchases summary of various customers for selected high volume products from last month. The data contain the following variables.

- User_ID: User ID
- Product_ID: Product ID
- Gender: Sex of User
- Age: Age in bins
- Occupation: Occupation (Masked)
- City_Category: Category of the City (A,B,C)
- Stay_In_Current_City_Years: Number of years stay in current city
- Marital_Status: Marital Status
- Product_Category_1: Product Category (Masked)
- Product_Category_2: Product may belongs to other category also (Masked)
- Product_Category_3: Product may belongs to other category also (Masked)
- Purchase: Purchase Amount

```{r glimpse}
glimpse(Friday)
```


Bar Chart {data-orientation=rows}
===

Row {data-height=350}
---

###
Bar chart is a graphical display good for the general audience. Here, we study the distribution of Age Group of the company's customers who purchased their products on Black Friday.

**Usage:** barplot(height, ...)

A bar chart can be horizontal or vertical. Using the argument <span Style="color:orange">col</span>, we can assign a color for bars. The argument <span Style="color:orange">main</span> could be used to change the title of the figure. We can  use RGB color code to assign colors.


**Note** The margin of a figure could be set using the <span Style="color:blue">par()</span> function. The order of the setting is <span Style="color:orange">c(bottom, left, top, right)</span>.

### Analysis
A horizontal bar chart by the distribution of Age Group of the company’s customers who purchased their products on Black Friday. It looks to be right skewed to the older ages and the younger the more is bought.

Row {data-height=650}
---

### **Vertical Bar Chart**
```{r bar1}
par(mgp=c(4,1,0)) # change the margin line for the axis title, axis labels and axis line
par(mar=c(5,7,4,2)) # set margin of the figure
barplot(table(Friday$Age), col = "lightblue", main = "Distribution of Purchases by Customer's Age", horiz = TRUE, xlab = "Number of Purchases", 
        ylab = "Age Group", las = 1)
```


### **Horizontal Bar Chart**
```{r}
par(mgp=c(4,1,0)) # change the margin line for the axis title, axis labels and axis line
par(mar=c(5,7,4,2)) # set margin of the figure
Friday%>%
  ggplot(aes(x = Age)) +
  geom_bar(fill="#69b3a2") +
  coord_flip() +
  labs(title = "Distribution of Purchases by Customer's Age",
       x = "Age Groups",
       y = "Number of Purchases") -> bar1
ggplotly(bar1)
```

Pie Chart
===

Column {data-width=500}
---

Similarly, we can use pie chart to study the distribution of the city category.

**Usage:** pie(height, ...)

**Tip:** Use color palette to choose colors (Google search: color scheme generator).

### Analysis
This pie chart has three cities, but it may not say much as cities are not labeled and not much is labeled. It only shows the percentages.

Column {data-width=500}
---

### Distribution fo City Category
```{r pie}
H <- table(Friday$City_Category)
percent <- round(100*H/sum(H), 1) #calculate percentages
pie_labels <- paste(percent, "%", sep="") # include %
pie(H, main = "Distribution of City Category", labels = pie_labels, col = c("#54d2d2", "#fb6f92", "#f8aa4b"))
legend("topright", c("A", "B", "C"), cex = 0.8, fill = c("#54d2d2", "#ffcb00", "#f8aa4b"))
```

Histogram
===

Column {data-width=500}
---

###
Histogram is used when we want to study the distribution of a quantitative variable. Here we study the distribution of customer purchase amount.

**Usage:** hist(x, ...)

```{r histogram}
Friday %>% ggplot(aes(x = Purchase)) +
  geom_histogram(fill="blue") +
  labs(title = "Distribution of Customer Purchase Amount",
       x = "Purchase Amount (British Pounds)")
```

Column {data-width=500}
---

### Analysis
This appears to be a right skewed histogram and looks multimodal as there are several peaks. It also looks like most purchases are around 5k to 10k

Boxplot
===

Column {.tabset data-width=550}
---

### Boxplot 1

Here, we talk about another graphical display that can be used to study the distribution of a quantitative variable: box and whisker plot (boxplot).


**Usage:** boxplot(x, ...) or boxplot(formula, ...)

```{r boxplot1}
boxplot(Friday$Purchase, xlab = "Purchase Amount", ylab = "British Pounds")
```


### Boxplot 2

In general, a boxplot is used when we want to compare the distributions of several quantitative variables. In the following we study the distribution of customer purchase amount among different age groups.

```{r}
boxplot(Purchase ~ Gender + Marital_Status, data = Friday, main = "Distribution of Purchase by Sex and Marital Status", 
        xlab = "Sex and Marital Status", ylab = "Purchase", cex.lab = 0.75, cex.axis = 0.5,
        names = c("Female & Single", "Male & Single", "Female & Married", "Male & Married"))
```

Column {data-width=450}
---

### Analysis of Boxplot 1
The boxplot is right skewed as the right whisker is longer than the left whisker. There is also some outliers to the right.

### Analysis of Boxplot 2
It looks like the distribution of purchase by sex an marital status does not differ by much. There are slight differences but they all appear to be all right skewed and some outliers to the right also.

Scatterplot
===

Column {data-width=500}
---

###
When we want to study the relationship of two quantitative variables, a scatterplot can be used. Since this data set doesn't have another quantitavtive variable, we will use the built-in data <span class="orange">mtcars</span> in R. Then we study the relationship of miles per gallon against the weight of vehicles.

```{r}
plot(mpg ~ wt, data = mtcars, 
     xlab = "Weight (1000 lbs)", 
     ylab = "Miles per Gallon", pch = 19, col = "blue")
```


Column {data-width=500}
---

### Analysis
The graph is showing a moderately strong negative linear relationship between the two variables. The heavier the vehicle the less miles per gallon used.

Line Plot
===

Column {.tabset data-width=350}
---

### Data
Since the Black Friday Data are not time series data, it is not approriate to use a line plot. In the following code chunk, we create a data frame using the forecasted highest temperatures from July 13 to July 22 in 2022 ([The Weather Channel](https://weather.com/)).

```{r data}
Date <- 13:22
Dayton_OH <- c(84, 86, 91, 89, 89, 91, 92, 91, 91, 91)
Houston_TX <- c(100, 97, 96, 94, 94, 94, 93, 93, 92, 91)
Denver_CO <- c(95, 85, 89, 96, 97, 96, 92, 91, 95, 96)
Fargo_ND <- c(86, 80, 84, 87, 90, 87, 83, 84, 87, 89)
df <- data.frame(Date, Dayton_OH, Houston_TX, Denver_CO, Fargo_ND)
datatable(df, rownames = FALSE, colnames = c("Date", "Dayton, OH", "Houston TX", "Denver, CO", "Fargo, ND"))
```

### Analysis
There are four different lines for the four cities with their highest temperatures. Fargo has the lowest high temps compared to the other cities and Houston has the highest compared to the others. However, Houston slowly declines in temperature as the days went by. The other three cities had varying temperatures from increasing to decrease.

Column {data-width=650}
---

### Line Chart

```{r line1}
plot(Date, Dayton_OH, type = "o", col = "blue",
     xlab = "Date in July",
     ylab = "Highest Temperature",
     ylim = c(80, 100))
lines (Date, Houston_TX, type = "o", col = "red")
lines (Date, Denver_CO, type = "o", col = "purple")
lines (Date, Fargo_ND, type = "o", col = "darkgreen")
# Add a Legend
legend ("topright", # Position of the Legend
        legend = c("Dayton, OH", "Houston, TX",
                   "Denver, CO", "Fargo, ND"), # Labels
        col = c("blue", "red", "purple", "darkgreen"), # Colors
        lty = 1, # Line types
        pch = 1,) # Point types
```